Assignment 3 (Sections 21 & 22)

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Write your code in the Code cells and your answer in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to understand and grade.

  3. Use Quarto to print the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  4. The assignment is worth 100 points, and is due on Wednesday, 8th May 2024 at 11:59 pm.

  5. Five points are properly formatting the assignment. The breakdown is as follows:

  • Must be an HTML file rendered using Quarto (2 pts). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file. If your issue doesn’t seem genuine, you will lose points.
  • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 pt)
  • Final answers of each question are written in Markdown cells (1 pt).
  • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text (1 pt)

1) Regression Problem - Miami housing

1a) Data preparation

Read the data miami-housing.csv. Check the description of the variables here. Split the data into 60% train and 40% test. Use random_state = 45. The response is SALE_PRC, and the rest of the columns are predictors, except PARCELNO. Print the shape of the predictors dataframe of the train data.

(2 points)

1b) Decision tree

Develop a decision tree model to predict SALE_PRC based on all the predictors. Use random_state = 45. Use the default hyperparameter values. What is the MAE (mean absolute error) on test data, and the cross-validated MAE?

(3 points)

1c) Tuning decision tree

Tune the hyperparameters of the decision tree model developed in the previous question, and compute the MAE on test data. You must tune the hyperparameters in the following manner:

The cross-validated MAE obtained must be less than $68,000. You must show the optimal values of the hyperparameters obtained, and the find the test MAE with the tuned model.

Hint:

  1. BayesSearchCV() may take less than a minute with max_depth and max_features
  2. You are free to decide which hyperparameters to tune.

(9 points)

1d) Bagging decision trees

Bag decision trees, and compute the out-of-bag MAE. Use enough number of trees, such that the MAE stabilizes. Other than n_estimators, use default values of hyperparameters.

The out-of-bag cross-validated MAE must be less than $48,000.

(4 points)

1e) Bagging without bootstrapping

Bag decision trees without bootstrapping, i.e., put bootstrap = False while bagging the trees, and compute the cross-valdiated MAE. Why is the MAE obtained much higher than that in the previous question, but lower than that obtained in 1(b)?

(1 point for code, 3 + 3 points for reasoning)

1f) Bagging without bootstrapping samples, but bootstrapping features

Bag decision trees without bootstrapping samplse, but bootstrapping features, i.e., put bootstrap = False, and bootstrap_features = True while bagging the trees, and compute the cross-validated MAE. Why is the MAE obtained much lower than that in the previous question?

(1 point for code, 3 points for reasoning)

1g) Tuning bagged tree model

1g)i) Approaches

There are two approaches for tuning a bagged tree model:

  1. Out of bag prediction

  2. KK-fold cross validation using GridSearchCV.

What is the advantage of each approach over the other, i.e., what is the advantage of the out-of-bag approach over KK-fold cross validation, and what is the advantage of KK-fold cross validation over the out-of-bag approach?

(3 + 3 points)

1g)ii) Tuning the hyperparameters

Tune the hyperparameters of the bagged tree model developed in 1(d). You may use either of the approaches mentioned in the previous question. Show the optimal values of the hyperparameters obtained. Compute the MAE on test data with the tuned model. Your cross-validated MAE must be less than the cross-validate MAE ontained in the previous question.

It is up to you to pick the hyperparameters and their values in the grid.

Hint:

GridSearchCV() may work better than BayesSearchCV() in this case. Why?

(9 points)

1h) Random forest

1h)(i) Tuning random forest

Tune a random forest model to predict SALE_PRC, and compute the MAE on test data. The cross-validated MAE must be less than $46,000.

It is up to you to pick the hyperparameters and their values in the grid.

Hint: OOB approach will take less than a minute.

(9 points)

1h)(ii) Feature importance

Arrange and print the predictors in decreasing order of importance.

(4 points)

1h)(iii) Feature selection

Drop the least important predictor, and find the cross-validated MAE of the tuned model again. You may need to adjust the max_features hyperparameter to account for the dropped predictor. Did the cross-validate MAE reduce?

(4 points)

1h)(iv) Random forest vs bagging: max_features

Note that the max_features hyperparameter is there both in the RandomForestRegressor() function and the BaggingRegressor() function. Does it have the same meaning in both the functions? If not, then what is the difference?

Hint: Check scikit-learn documentation

(1 + 3 points)

2) Classification - Term deposit

The data for this question is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls, where bank clients were called to subscribe for a term deposit.

There is a train data - train.csv, which you will use to develop a model. There is a test data - test.csv, which you will use to test your model. Each dataset has the following attributes about the clients called in the marketing campaign:

  1. age: Age of the client

  2. education: Education level of the client

  3. day: Day of the month the call is made

  4. month: Month of the call

  5. y: did the client subscribe to a term deposit?

  6. duration: Call duration, in seconds. This attribute highly affects the output target (e.g., if duration=0 then y=‘no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for inference purposes and should be discarded if the intention is to have a realistic predictive model.

(Raw data source: Source. Do not use the raw data source for this assignment. It is just for reference.)

2a) Data preparation

Convert all the categorical predictors in the data to dummy variables. Note that month and education are categorical variables.

(2 points)

2b) Random forest

Develop and tune a random forest model to predict the probability of a client subscribing to a term deposit based on age, education, day and month. The model must have:

  1. Minimum overall classification accuracy of 75% among the classification accuracies on train.csv, and test.csv.

  2. Minimum recall of 60% among the recall on train.csv, and test.csv.

Print the accuracy and recall for both the datasets - train.csv, and test.csv.

Note that:

  1. You cannot use duration as a predictor. The predictor is not useful for prediction because its value is determined after the marketing call ends. However, after the call ends, we already know whether the client responded positively or negatively.

  2. You are free to choose any value of threshold probability for classifying observations. However, you must use the same threshold on both the datasets.

  3. Use cross-validation on train data to optimize the model hyperparameters.

  4. Using the optimal model hyperparameters obtained in (iii), develop the decision tree model. Plot the cross-validated accuracy and recall against decision threshold probability. Tune the decision threshold probability based on the plot, or the data underlying the plot to achieve the required trade-off between recall and accuracy.

  5. Evaluate the accuracy and recall of the developed model with the tuned decision threshold probability on both the datasets. Note that the test dataset must only be used to evaluate performance metrics, and not optimize any hyperparameters or decision threshold probability.

(22 points - 8 points for tuning the hyperparameters, 5 points for making the plot, 5 points for tuning the decision threshold probability based on the plot, and 4 points for printing the accuracy & recall on both the datasets)

Hint:

  1. Restrict the search for max_depth to a maximum of 25, and max_leaf_nodes to a maximum of 45. Without this restriction, you may get a better recall for threshold probability = 0.5, but are likely to get a worse trade-off between recall and accuracy. Tune max_features, max_depth, and max_leaf_nodes with OOB cross-validation.

  2. Use oob_decision_function_ for OOB cross-validated probabilities.

It is up to you to pick the hyperparameters and their values in the grid.

3) Predictor transformations in trees

Can a non-linear monotonic transformation of predictors (such as log(), sqrt() etc.) be useful in improving the accuracy of decision tree models?

(4 points for answer)